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o ^ 

04 I We consider the problem of regularized maximum likelihood estimation 

^-^1 for the structure and parameters of a high-dimensional, sparse directed 

Ci I acyclic graphical (DAG) model with Gaussian distribution, or equivalently, 

of a Gaussian structural equation model. We show that the £o-penalized 
maximum likelihood estimator of a DAG has about the same number of 
edges as the minimal-edge I-MAP (a DAG with minimal number of edges 
representing the distribution), and that it converges in Frobenius norm. 
We allow the number of nodes p to be much larger than sample size n 
I but assume a sparsity condition and that any representation of the true 

. DAG has at least a fixed proportion of its non-zero edge weights above the 

' noise level. Our results do not rely on the restrictive strong faithfulness 

^ . condition which is required for methods based on conditional independence 

testing such as the PC-algorithm. 



^ ; 1 Introduction 

m ' 

Tjj- I Directed acyclic graphs (DAGs) and corresponding directed gra phical models 



are ke y co ncepts for ca usal inference, see for example the books by lSpirtes et al 
in : [2OO0I I andlPearJ [2000l |. From an estimation point of view, a first step consists 

I in estimating the Markov equivalence class of the true underlying causal DAG 

based on observational data, and from this, one infers identi fiable causal effects 



and lower bounds for all causal effects [Maathuis et all 120091 ] . This strategy has 



, . , been applied to, and to a certain extent validated using high- throughput, and 

^ ' hence high-dimensional, data in biology Maathuis et al. , 20ld | . It is of primary 



H 

■ importance to understand limitations and potential of methods in terms of 

subtle and often uncheckable assumptions, and in this respect, our results here 
shed new light. 

We focus here on the problem of estimating the Markov equivalence class of 
DAGs (or more generally of a so-called minimal-edge I-MAP) in the setting of 
observational Gaussian data where the number p of variables or nodes in the 
DAG may greatly exceed sample size n. We consider the £o-penalized max- 
imum likelihood estimator, and we relat e and compare our new results and 



conditions to the popular PC-algorithm Spirtes et al.l . l2000f | and its theoreti- 
cal analysis. To the best of our knowledge, the latter is so far the only work 
providing theoretical guarantees for inferring the Markov equivalence class of 
DAGs in the high-dimensional setting. We emphasize that the popular £i-norm 
regularization of the likelihood is inappropriate here, leading to an objective 
function which is not constant over equivalent DAGs encoding the same dis- 
tribution. On the other hand, ^o-penalization leads to invariant scores over 
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equivalent DAGs. The computational difficulties are primarily due to the non- 
convex constraint that the directed graph should be acyclic, and the addi- 
tional complication of the £q- in comparison to e.g. £i-norm penalization is 
rather marginal. A computationally feasible algorithm for exact ^o-penalized 
maximum likeliho od estimation for the Markov e quivalence class of DAGs has 
been proposed by Silander and Mvllvmaki 2006l |: for larger graphs, with more 
than about 20 nodes, approx imate algorithms can be used Chickering . 20021 . 



Hauser and Biihlmannl . l201l[ |. 



1.1 Relation to other work 



We analyze the ^o-penalized maximum likelihood estimator for the equivalence 
class of DAGs in the Gaussian setting . Pioneering work for the low-dimensional 
case on this problem has been done bv lChickeringj 2002] who proved consistency 
of the BIC-score and provided an algorithm, called greedy equivalent search 
(GES), which greedily proceeds in the space of Markov equivalence classes. 
While the GES-algorithm ca n also be used in the high-dimensional scenario 
[Hauser and Biihlmann . the asymptotic consistency of BIG is established 

only for the case with a fixed distribution (with p < oo) where the sample size 
n — 7- cxD. Chickering's first significant analysis does not provide any insights 
for the high-dimensional case with its subtle interplay of signal strength, noise 
level and identifiability conditions. 

obins et al.l |2003l ] present refined analyses for causal inference under the 
view point of uniform consistency as sample size n — >• oo. There, problem- 
atic issues with the so-ca lled faithfulness condition (see Section II. 2p arise, and 
Zhang and Spirte ] [20031 ] introduce the notion of strong faithfulness (see ([2])), 
as a way to address some of the the raised major problems. None of these works 
consider high-dimensional inference, but their pointing to the faithfulness con- 
dition and its version are important. 



<!alisch and Biihlmannl [20071 ] provide consistency results of the PG-algorithm 
' Spirtes et al.l . l2000l ] for estimating the Markov equivalence class of DAGs based 



on Gaussian observational data, in the high-dimensional, sparse setting. One of 
the conditions used is a restricted version of strong faithfulness in ([2]) modified 
for sparse problems, see ([3]). Our analysis with the penalized MLE is completely 
different and circumvents s uch strong faithful ness conditions which are often 



very restrictive as shown by lUhler et al.l 20121 ] 



The theoretical high-dimensional analysis presented here is very different and 
more challenging than for multivariate regression or covariance estimation, due 
to the unknow n order among the variables. For known order, as e.g. in time 
series problem, Shojaie and Michailidis 20ld ] present results for estimation of 
high-dimensional DAGs; but the case with unknown order considered in the 
present paper requires major new theoretical ideas and development. 
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1.2 Directed graphical and structural equation models 



Consider the following model. There is a DAG Dq whose p nodes correspond 
to random variables Xi, . . . , Xp: assume that 

Xi, . . . , Xp ~ 7Vp(0, So) with density /eo(-)> 

A/'p(0,So) is Markovian with respect to Dq, (1) 

where the Markov property can be understood as the factorization property 
where the joint Gaussian density fsoixi, ■ ■ ■ be factorized as 



with pa(j) denoting the set of parents of node j Lauritzen . 19961 . cf. 



It is well-known that in general, there exists another DAG D such that the 
distribution A/'(0,So) is Markovian with respect to D. The set of all such 
other DAGs build the so-called Markov equi valence class £(D n ) whi ch can be 



characterized in terms of a bi-directed graph [Andersson et al.l . 119971 . cf.]. The 
Markov equivalence class if (-Do) can be identified from the observational data 
distribution AA(0, Sq) under the assumption of faithfulness. 



Definition of faithfulness Spirtes et al . 200(\ . cf.]: For a DAG D, a distribution 



P is called faithful with respect to D if and only if all conditional independences 
are encoded by the DAG D. 

Faithfulness is stronger than a Markov assumption: the latter allows to infer 
some conditional independences from the DAG while the former requires that all 
of them can be inferred from the DAG (i.e. also the ones which are entailed by a 
Markov condition). Failure of faithfulness is "rare", having Lebesgue measure 
zero, if the edge weights (the coefficients in the equivalent linear structural 
equation model) are chosen from a distribution which is absolutely continuous 
with respect to Lebesgue measure. However, for statistical estimation, we often 
require sufficiently strong detectability of conditional dependencies, given by 
the notion of strong faithfulness. 



Definition of strong faithfulness in the Gaussian case fZhana and Spirteibnni l: 



For a DAG D, a Gaussian distribution P is called t- strongly faithful with respect 
to D if and only if 

min{ I Parcorr (Xj- ,Xk\Xs)\; Parcorr {Xj,Xk\Xs) ^ 0, 

S<Z{l,...,p}\{j,k}, j,ke{l,...,p} {j^k)}>T. (2) 



A typical requirement is strong faithfulness with r x y^sparsity • log(p)/n, see 
also below for the PC-algorithm. Strong faithfulness can be viewed as a condi- 
tion of "signal strength" in terms of non-zero partial correlations. As shown in 
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Uhler et alj 2012l |. strong faithfulness is a very restrictive condition for many 
DAGs. At the same time, it is essentially unavoidable for any algorithm for in- 
ferring the Markov equivalence class £{Dq) which relies on condit ional indepen- 



denc e testing. The most prominent example is the PC-algorithm Spirtes et al 



m 



20001]: consistent estimation for the Markov equivalence class £{Dq) is proved 



Kalisch and Biihlmannl 20071 ] for the Gaussian case assuming a restricted 



strong faithfulness condition 

mm{\PaTCorr{Xj,Xk\Xs)\;'PsircoTT{Xj,Xk\Xs) + 0, 

5C{l,...,p}\{j,A:}, ]5]<d, i,A:G{l,...,p} (j ^ A;)} > r, (3) 

where d is the maximal degree of the skeleton of Dq (the undirected graph 
when deleting all arrow directions in -Do); and essentially for r x \J dXogij)) jn. 
In comparison to (l2|), th e restriction is exploiting sparsity by looking only at 
sets S with \S\ < d, see iBiihlmann and van de Geerl [201 ll . Th. 1 3.11; but also 



such restricted strong faithfulness remains as a stringent condition [Uhler et al. 
20121 ]. The results in this paper for the £o-penalized MLE do not require a strong 



faithfulness condition as in ([2]) or ([3]): the reason is that the method is not 
relying on conditional independence testing but rather on penalized parameter 
estimation in terms of a linear structural equation model as explained next. 

A Gaussian DAG model as in ([1]) can always be equivalently represented as a 
linear structural equation model: 



X, 



fcgpa(j) 



(j 



,P), 



(4) 



where ei, . . . , e„ are independent, ej ~ A/'(0, \uj^\'^) and ej independent of {X^; k £ 
pa(j)}; note that pa(j) = paooU) depends on the true DAG Dq. 



2 The setting and the estimator 

We use here and in the sequel a terminology which does not rely on the standard 
language from graphical modeling since the required basics for the Gaussian case 
(see models ^ and ([4])) can be developed in a straightforward way. 

We consider n i.i.d. observations from the structural equation model ([H) which 
is equivalent to model ([I]). We denote by X := {Xi, . . . ,Xp) the n x p data 
matrix with n i.i.d. rows, each of them being M{0, $]o)-distributed, where is 
a non-singular covariance matrix. The relations between the variables in a row 
can be represented as 

X = XBo + E, (5) 

where Bq := {P^ j) is a p x p matrix with 13 j j = for all j, and where E as 
an n X p matrix of noise vectors E := (ei, . . . , ep), with ej independent of Xj. 
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whenever (3^ j 7^ 0. Furthermore, E has n i.i.d. rows which are Af{0,^o)- 
distributed, with Qq := diagdwJ'P, . . . , Iwpp) a p x p diagonal matrix. 

The model ([5]) implies that 

^0 = [{I - Bo)-Yno[{i - Bo)-']. 

We call {Bq, Q,q) a DAG corresponding to SqU The set of edges of this DAG is 
denoted by sq := sbq '■= {ik,j) ■ ^ / 0}, and in fact, f3^ j 7^ encodes for a 
directed edge k — )• j. We will assume in Condition 13.41 that {Bq,Qq) is sparse, 
in the sense that its number of (directed) edges sq is small. 

As described in Section fT^ with the concept of a Markov equivalence class, there 
are several DAGs {Bq, Qq) describing the same Sq and thus the same Gaussian 
distribution P = A/'(0, Sq). Throughout this paper, (i?0)^o) is defined as a 
DAG with a minimal number of edges. We call such a DAG a minimal-edge 
I-MAPi 

2.1 The £o-pGnalized maximum likelihood estimator 

We use a penalized maximum likelihood procedure to estimate the DAG {Bq, Oq). 
Let 

S„ := X'^X/n 

be the empirical covariance matrix based on the observations X. Given a p x 
p non-singular covariance matrix S, with inverse := the minus log- 

likelihood is proportional to 

/„(e) := trace(eS„) - logdet(e). 

We consider inverse covariance matrices that can be represented as a DAG. 
That is, we let 

e := @{B, n) := (/ - B)n~\l - Bf, 

where {B, ^l) is a DAG. The latter means that is a positive diagonal matrix 
and that, up to a permutation vr of the rows and columns, B can be written as 
a lower-diagonal matrix. 

The £o-peiialized maximum likelihood estimator is 
= argmin < ln{Q) + A^ss : 

e = e{B,n), for some DAG {B,n) with B eB}. (6) 



^Note that in relation to the true DAG Do in model Q, p'jj. j = for fc ^ pao^O). We 
do not make such explicit constraints here since we aim for a smallest DAG representing the 
distribution of X. 

^This deviates from the classical definition where the DAG is only a (directed acyclic) 
graph; we use the short terminology "DAG" for the whole graphical model with the distribu- 
tion and the graph encoded by the coefficient m atrix B and the error variances fl. 

It is a minimal I-MAP [Spirtes et al] . I2OO0I . Sec. 2. 3. 1.] with the additional property that 
it has minimal number of edges. 



5 



Here is the number of non-zero elements in B (corresponding to the number 
of edges in the DAG) and A > is a tuning parameter. The estimator is 
denoted by Q := @{B,Cl). It has s := edges. The cohection B is the set of 
aU edge weights B of DAGs {B, i}) which have at most an/ log p incoming edges 
(parents) at each node, where a > is given (see Condition 13. or a subset 
thereof. We will discuss this restriction in Subsection 14.21 We throughout 
assume Bq £ B, i.e. that the restrictions one puts on the edge weights are 
correct. 

The ^Q-peiialty in the estimator ensures that the penalized likelihood remains 
the same among all equivalent representations, e.g., among all DAGs from the 
same Markov equivalence class or among the equivalence class described in Re- 
mark [2T] below. This would not be true when choosing for example an ^i-norm 
penalization, see also Remark 12. 2[ From a computational point of view, this 
does not make the problem substantially more complex since the main difficulty 
is the optimization over the space of DAGs B. For p roblems with up to about 



p ~ 20 nodes, dynamic programming can be used jSilander and Mvllvmakil . 



20061 ). while for large-sc ale applications, greedy equivalent search has b een re- 



ported to perform well Chickering . 20021 . Hauser and Biihlmann . 2011]. 



Remark 2.1 We call two DAGs {Bi,Qi) and {B2,^2) equivalent if 
Q{Bi,0,i) = 0(i?2,^^i) and if in addition they have the same number of undi- 
rected edges^ In our analysis, we will identify DAGs which are in the same 
equivalence class. Thus, our aim is to estimate an arbitrary member of the 
equivalence class of {Bq,Qo), by a suitable member in the equivalence class of 

{B,no). 



2.1.1 The main results and their implications 

We show in Theorem 13.11 that with a choice of the tuning parameter of order 
logp/n{{p/so)\/l), the number of edges s of the estimator is of the same order of 
magnitude as the number of edges sq of the true DAG. Moreover, we show that 
B and 17 converges in Frobenius norm to some Bq and respectively, where 
Q{Bo,(Iq) = ©0 is a representation of the true DAG with s edges (see Section 
12. 2p . and with s of the same order of magnitude as sq. The rate of convergence 
is of order A^sq. To arrive at this result, we need that at least a fixed proportion 
of the non-zero coefficients of any representation of the true DAG is above the 
"noise level", the latter being of order y%gp/n(y7)/so V 1) (see Condition 
13. 5p : in analogy to regression, we call this the "beta-min" condition. Some of 
our other conditions are trivially satisfied when p = 0(n/log(n)) is sufficiently 
small. 

The "noise level" indicates two regimes for {n,p,so). If so is at least of the 
order of p (or larger) , than the "noise level" is of the order y^log{p)/n which is 
small even if p is very large relative to n. This scenario is often realistic saying 
that a fixed non-zero proportion of the nodes has at least one parent: we call it 

^This definition is not the same as for a Markov equivaience class. 
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the standard sparsity regime. The other scenario is for sq ^ p, corresponding 
to very sparse DAGs, which we call the ultra-high sparsity regime. The reason 
for the two different noise level scenarios is that we estimate p error variances in 
Qq: when sq <C p, the term from estimating these error variances is dominating. 

When making a more stringent "beta-min" condition and choosing the regu- 
larization parameter of larger order than the "noise level" (or the same order 
if So = we can recover the minimal-edge I- MAP. When assuming in 

addition the faithfulness condition (but not requiring strong faithfulness), see 
Section [121 we can recover the true underlying Markov equivalence class. This 
then allows to derive bounds for causal inference, exactly along the lines of 
Maathuis et~aD [ioO^. 



The £o-penalized MLE can be easily adapted to the case where the noise vari- 
ances in flQ are known up to a scalar, for example when all noise variances are 
known to be equal but their val ue is unknown. Then, the t rue DAG can be 
identified from the distribution Peters and Biihlmann . 2012l |. We show that 



in this case, under an identifiability condition on the noise variances, the de- 
penalized maximum likelihood estimator finds the representation (and hence 
the true DAG) with the prescribed noise variances, and the rate of convergence 
for the Probenius norm is of order A^sq (see Theorem 15. ip . We assume in this 
context that p is sufficiently smaller than n/logn. 

Remark 2.2 The £o-penalty }?sb is proportional to the number of edges sb of 
B. Alternatively, inspired by the Lasso, one may consider applying a penalty 
proportional to the ii-norm 

\\Bh:=Y,\Pk,j\- 

However, the ii-penalized likelihood ln{Q{B,Q)) -|- A||-B||i will not be constant 
within a class of equivalent DAGs. This makes the ii-norm penalty less suit- 
able for estimation of DAGs in general. Under the assumption of equal noise 
variances, however, the ii-penalty may be a useful alternative regularizer since 
the true DAG is identifiable, see Section\^ We omit the details. 

2.2 Permutations and the order of the variables 

The model ([5]) can be written as 

Xj=X/30 + e,-, j = l,...,p, 
with f3j the j-th column of Bq. Let us write for any vector /? G M^, 

\\Xf3f ■.= (3^^0(3, \\X/3\\l:=/3^En/3. 

Por a permutation vr of {l,...,p}, which plays the role of an order of the 
variables, we let Bo(7r) be the matrix obtained by doing a Gram-Schmidt or- 
thogonalization for || • ||, starting with X^^^ and finishing by projecting X^ri on 
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7 ■ ■ ■ ) -^TTp- Moreover, we let Qoi'^) be the diagonal matrix of the error vari- 
ances. Note thus that Bq{tt) is lower-diagonal after permutation of its rows and 
columns. Furthermore, 

Go = Go(^o(^),j7o(7r)),V7r. 

The set of incoming edges at node j (non-zero coefficients of the jth column 
of i?o('?r)) is denoted by Sj{7r), and we let Sj^n) := [^^(Tr)!. Moreover, we let 
^i'^) — X]j=i ^j(7r) be the total number of edges of Bq(j:). Thus, s(7r) = Sj^^^^^y 

Example 2.1 (AR(l)-model). Suppose the true DAG is a directed chain from 
Xp along Xp^i, . . . ,X2 to Xi with a corresponding structural equation model: 

Xp — 6p, 

X,- = /3%+i + (j = l,...,p-l), 

where e ~ Af{0, Qq) with = diag(l - {^^ f, . . . , 1 - {^^ f, 1) and \I3^\ < 1. 
The error variances are chosen such that Var{Xj) = 1 for all j. The covari- 
ance matrix is of Toeplitz form (So)ij = and the the model satisfies 

the directed global Markov "p roperty (which is equivalent to the concept of d- 
separation) \Lauritzei\ . 1996 . cf. Sec. 3. 2. 2]. Therefore, we have that projecting 



X^^ on X^^^.^ , . . . , (/c = 1, . . . ,p - 1) 

leads to at most two non-zero regression coefficients in every column of Bq{'k) 
(corresponding to the largest index ji < vr^ and smallest index j2 > Tr^ if 
TTfc+i, . . . , TTp contains indices smaller and larger than TT{k); or corresponding 
to the largest j < vr^ if tt^^i, . . . ,iTp contains only smaller indices than iTk; or 
corresponding to the smallest j > iTk ifTTk+i, . . . jiTp contains only smaller larger 
indices than TT{k) ). Thus, we have that Sj^n) < 2 for all j and all n and hence 
Condition \3.4l given below, holds. 

The absolute values of the non-zero coefficients in Bo{'7t) = j(vr)) decrease 
monotonely as the index- distance d{j) = mmk=j+i^.,.^p |vrfc — tTjI increases. Thus, 
whenever d{j) < A for some (large) value of A, there are at most two (since 
Sj{T^) < 2j coefficients with |/3fcj('/r)| < C(A) for some value C(A) (which 
decreases as A increases). Therefore, clearly, there are at most 2([p/AJ + 1) 
coefficients (edges) whose values are smaller than |/3^j(vr)| < C(A), and all 
other non-zero coefficients (at least p— [p/AJ — 2j[l are larger than C(A). For 
e.g. A > 3 this implies that there are at most 2{[p/3\ + 1) edges with non- 
zero coefficients being smaller than C{A), and at least p — [p/3j — 2 edges with 
non-zero coefficients larger than C(A). This implies that Condition \ 3. 51 given 
below, holds. 

Let B - or one of the members in its equivalence class - be lower-diagonal after 
permutation tt, and define Bq := Bq{tt). The number of edges of Bq is denoted 



^There are at least p — ([p/AJ + 1) indices (nodes) j with d{j) < A; and there is at least 
one non-zero coefficient (edge) from all of them except one (the starting node). 
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by s = s{7r). The DAGs {B,Q) and {Bq,Qq) share the same lower-diagonal 
structure (but not necessarily the same set of zero coefficients). We will show 
that s := s{-k) is with large probability of the same order of magnitude as sq 
(see Theorem 13. ip . Thus, if the true DAG (-Bqj^o) is sparse, then with large 
probability the DAG (-Bq ('''')) ^o('^)) is sparse as well, which means that on 
average, the number of incoming edges at a node is small. 

Note that tt is a random permutation and that there are in total a large num- 
ber (namely pi) permutations. Analytical control over these p\ permutations re- 

quire s a very different technique than dealing with known order Shoiaie and Michailidid . 



20ld | or with multivariate regression or covariance estimation problems. We ex- 



plain this in more detail in Section [7.3. 1[ 



3 Conditions and main result 

We write Eq =: {(^kj) and we let (t| := ajj, j = I, . . . ,p. 
Condition 3.1 For some constant a^, it holds that 

max cr? < an. 

i<j<p ^ 

Condition 3.2 The smallest eigenvalue A'^-^^ o/Sq is non-zero. (See also ^). 

Condition 3.3 For a given constant a > 0, it holds that for any B = (/3i, . . . , (3p) £ 
B, where B is as in that sp. < an/logp for all j = 1, . . . ,p, where for a 
vector f3 £W we denote the cardinality of its support set by sp := i^{f3k ^ 0}. 

Condition 3.4 For some constant a and any permutation tt, and all j, 

Sj('/r) < an/ \ogp, 

where Sj(vr) = Sgo is the number of incoming edges of the DAG (i?o(7r), $^o('^)) 
at node j. (See also 

Condition 3.5 There exist constants < r/i < 1 and < tjq < 1 — r]i, such 
that for any permutation n, the DAG (i?o(7r), $7o(^)) (which has s{Tr) edges) has 
at least (1 -r/i)s(7r) edges {k,j) with |^°j(vr)| > ^/logp/n{^J])/so V l)/??o- 



Following iBiihlmann and van de Geeri 201 1[ | , we refer to Condition 13.51 as the 



"beta-min" condition. It is a "kind of replacement" of the strong faithfulness 
condition in ([2]) that is required for consistency of the PC algorithm and variants 
thereof, see Section 11.21 A detailed discussion about the assumptions is given 
in Section HI 

In the current section, we will present an asymptotic formulation for clarity. We 
will provide a non-asymptotic result in Section [71 Our results depend on Eg via 
the constants fio, Amin) on the further constants 70 := {a,a,r]Q,r]i) used in the 
conditions, and on oq where 1 — ao is the confidence level of the statement. 
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We assume that we can take 70 sufficiently small. Moreover, we state the 
results with oq := (4/p) A .05 to avoid digressions concerning the confidence 
level. Explicit expressions can be found in Theorem 17. 4| where we simplified 
the situation by assuming n is sufficiently large and logp/n is sufficiently small. 

With the notation z = 0(1) we mean that z can be bounded by a constant 
depending only on (Tq and Amin- Moreover, with z x 1 we mean z = 0{1) 
and 1/z = 0(1). Furthermore, the Frobenius norm is defined as = 
fc=i lAjP)^^^ for & p >i p matrix B with elements (i^j. 

Theorem 3.1 Assume Conditions [3J[ [XH COl [gl^l (I'^'d [g3l with 70 := 
{a,a,r]o,r]i) sufficiently small, but allowing l/||7o||i = 0{1). Let 1 — oq be 
the confidence level, with ao := (4/p) A .05. Then for a choice 

A^xi^f^VlY 
n \so ) 

it holds that with probability at least 1 — ao, 

\\B - BoWl + \\n - QoWl = O{\'^so), 
where {Bq,^Iq) are defined in Section [2^ and 



The proof is given in Section [71 Theorem 17.41 gives some explicit bounds. 

Remark 3.1 If the beta-min condition ( Condition \3.5\) holds with 7]i = and 
with very small values for r]o := namely of order l/s(7r), then one obtains 

the screening property: all edges in {Bq,Qq) are then with large probability also 
present in {B, Cl). 

Moreover, by taking := A^(so) very large (of order sologp/n) , one can obtain 
with large probability that s < sq. In other words, by imposing a strong beta-min 
condition, which is severe if sq is large, one recovers with high probability the 
edges of the minimal-edge I-MAP exactly. However, in Theorem \3.1l we do not 
use such values for rjo and X, but instead x logp/n (when p = O{so)) and 
r]Q ^ 1. Thus, we generally do not recover the true edges. This is the price for 
dealing with a large p situation and an sq possibly growing in n. Such problems 
do not show up in asymptotics with p fixed. 

Remark 3.2 To avoid technical digressions in our proofs, we assume a Gaus- 
sian distribution for the observations where zero correlations mean indepen- 
dence. We use in Lemma \ 7.4\ that if for some ej, Ee^ = 0, then also the 
conditional expectation of ej given variables X^. that are uncorrelated with ij 
is zero. In the non- Gaussian case, this is no longer true. However, one can 
still derive similar results, along a line of proof that does not use condition- 
ing but instead concentration inequalities for averages of products of random 
variables (empirical covariances). This means that our results go through for 
observations which are sub-Gaussian. The proofs then rely on concentration 



10 



inequalities of Bernstein-type. As generally the observations are not bounded, 
one cannot directly apply eleg ant concentration inequalities from the literature 
such as Bousquet's ineaua litu iBousauei.\2003i]. Fo r the u nbounded case, there 



are however the results in van de Geer and Lederer which can be used. 



4 A discussion of the conditions 
4.1 Bounds for the noise variances 

For all vr and j and for any (3j with /3jj = 0, we have \\Xj — X(3j\\ = \\Xf3~\\ 
where fi^ ■ = — /3fcj- for k ^ j and fij- = 1. It follows that for any vr and j, 

|cD»|2 = ||X,-X/30(^)f >A^,^, 

with A^jj^ the smallest eigenvalue of Sq. Moreover, clearly |a}^(7r)p < a'j. 
Hence, Conditions 13.11 and 13.21 imply that for all vr and j 

< A^i„ < |^I;0(7r)|2 < al 

Furthermore, A^^j^ > is implied by 

min|a;°p>0, (7) 

3 

since det(So) = det(r2o) = 11^=1 "^j- Thus, Condition 13.21 is equivalent to ([7]). 



4.2 Overfitting 

Condition [331 will ensure that the penalized minus log-likelihood cannot become 
minus infinity. If n or more edges are allowed at a node, say at node j, the 
estimator will overfit the data at this node, giving a residual variance tD| = 0. 
The penalized minus log-likelihood is proportional to X]j=i logti)| + A^s which 
will be — oo if one allows that ojj vanishes. Note that the penalty as such does 
not prevent this type of overfitting. Therefore, we need a restriction on the 
class of possible DAGs, and Condition 13.31 serves this purpose. We will show 
in Lemma 17.51 that Conditions 13.11 13.21 and 13.31 imply that for an appropriate 
constant Kq > 0, it holds for all j that > 1/Kq with large probability. 



4.3 The beta-min condition 

One may circumvent the beta-min condition if one allows for edges with weights 
below some noise level A* to be set to zero. Here, A* := y^logp/n/rjQ for some 
suitable ryg > 0. Instead of trying to estimate the true DAG {Bq, 0,q), one now 
aims at estimating its best sparse approximation {Bq,Qq), which is defined as 
follows. Let for any DAG {B,Q), and for G = Q{B,Q), the weights Beiir) 
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be obtained by doing the Gram- Schmidt orthogonahzation for || • where 
S = e-^ and \\X/3\\l := (i'^'Lp, j3 G W. Thus Se(7r) is lower-diagonal after 
the permutation vr of its rows and columns, and for appropriate i^Q^n), the 
DAG {Be{TT),nei'^)) satisfies 

Let se(vr) := SB@(7r) be the number of edges of i?0(7r). Connecting this with 
our previous notation, we note that 

Be^in) = Bo{7r), QooiT^) = ^o{t^), ■S0o(7r) = ^(vr). 

Let now for some constant tjq > 0, 

4(7r) := #|(A;,j) : |/3e,fcj(7r)| > 0ogp/n(^p/se(vr) V l)/r?S 

We then take 

:= argmin{Z(G) : @ = {B,Q) a DAG, s*q{tt) > (1 - 7?^)s0(^), V vr}, 

where < ry^l' < 1 and 1{Q) = trace(0So) - logdet(9) =E/„,(6) is the theoreti- 
cal counterpart of the minus log-likelihood. (Note that ©o := Sg ^ is the overall 
minimizer of We let {Bq,Qq) be a solution of 

ei = @i{B*„ni) 

with the minimum number of edges. With constants rj^ and rjl sufficient small, 
one may replace ©o = 0o(-Boj^o) by ©g = 0o(i?O'^o) ™ analysis. In this 
way, one can avoid the beta-min condition, provided that the bias term that 
will now appear in the bounds is small enough. 



4.3.1 The beta-min condition and the number of edges 



We further note that Conditions 13.11 13.21 and 13.51 imply Condition 13.41 with 



ALn(l - ^l) 



This is because for all j. 



rjQ n ■' 



4.4 The high sparsity regime where sq p 

The reason why we see a term p/so V 1 appearing in the tuning parameter 
(see Theorem 13. ip and in the beta-min condition (Condition 13. 5p is due to the 
estimation of the p unknown variances, which gives a term of order plogp/n in 
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our bounds for the squared Frobenius norm. If sq <^ p, the true DAG has many 
disconnected components, and in fact it then has many isolated points. The 
variables in one component are uncorrelated with those in another component. 
We see this in the zeroes in the matrix Sq. The connected components and 
isolated points are easily detected by assuming that non-zero correlations 
are at least ylogp/n/ryc in absolute value for an appropriate (sufficiently small) 
constant r]c- Then we can do the analysis connected component by connected 
component. To summarize, the situation p = O{so) appears to be the most 
interesting. Alternatively, when the noise variances are known up to a 
scalar (for example if it is known that all noise variances are equal), we need 
not estimate these variances anymore and the term of order plogp/n does not 
appear in the bounds, provided an identifiably condition on the noise variances 
holds and p is sufficiently smaller than n/logn. This will be shown in the next 
section. 



5 The case of equal variances 



Suppose that the noise variances 

are known up to a multiplicative scalar. To simplify the exposi- 
tion, let us assume that 

c^o = . . . = c^o = 1. 



p 



The £o-penalized maximum likelihood estimator now becomes 

B := argmin | trace ^(/ - B){I - S)^S„^ + X^sb ■ 

{B,I) a DAG, B e b\, (9) 



where i? is as in ([6]). 

The main Theorem 13.11 as well as the remarks in Section 13.11 apply to the 
estimator ([9]) as well, assuming exactly the same Conditions 13.11 - 13. 51 

For the case where p = 0(n/log(n)) is sufficiently small, we obtain consis- 
tent estimation of the true underlying DAG and we gain in comparison to the 
main Theorem 13.11 by excluding the additional factor (p/sq VI). We make the 
following assumptions. 

Condition 5.1 There exists a constant rji^ > such that for all r2o(^) 7^ I? 

ix:(hV)p-i)'>i/^.- 

Condition 5.2 There exists a constant a^, such that 

p < a^nj log n. 
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We call Condition 15.11 the "omega-min" condition. It leads to identification of 
the DAG with equal variances. Condition 15.21 ensures that the rate of conver- 
gence is fast enough to ensure that eventually we choose the right permutation. 
Note that it implies Conditions 13.31 and 13.41 with a = a = a-t. 

Let TTo be defined by Bq{'Kq) = Bp. Since Bp is identifiabl e from the obser- 



vational distribution AA(0, Eq) [Peters and Biihlmann . 2012], see also Section 



I2.1.1i ttq corresponds to the unique true ordering of the variables. 



Theorem 5.1 Assume Conditions \ 3.1\ and \3.^ and Conditions \5.1\ and [57 

Let ao := (4/p) A .05. Then for 7^, := {a^,ri^) suitably small, but allowing 
1/Il7*lli = ^(1); ^''^d, for X logp/n, it holds with with probability at least 
1 — ao, that 71" = TTo, and 



B - Bo\\l + XH = OiX'so) 



The proof is given in Subsection 17.51 

Thus, we find s = O{so), but we do not show s x sq- To establish the latter, 
one again needs a beta-min condition, but this time only on the DAG (Bq,!), 
and not on any of the other representations (i?o(^); ^^0 ('''")) with vr 7^ vro. This 
is a much simplified and weaker assumption than in Condition 13.51 Further- 
more, choosing A x sq logp/n sufficiently large, exact edge recovery follows by 
the beta-min condition for the true DAG (Bq,!), that is, the condition that 
min{|/32 jl : Pk,j / 0} > soy^logp/n/r]^ for some sufficiently small t]^ > 0. 



6 Conclusions 

We establish the first results of the £o-penalized MLE for estimation of the 
minimal-edge I-MAP (the smallest DAG which can generate the data-generating 
distribution) in the high-dimensional sparse setting. Thereby, we avoid the 
strong-faithfulness condition ([2]) or its restr icted version (|3l): th e latter is re- 



quired for consistency of the PC-algorithm Soirtes et al.l. l2000| . The strong- 



faithfulness condition is typically very restrictive Uhler et al.l . |2012| | and hence. 



our results contribute in relaxing such very restrictive assumptions. 

Our main assumption is Condition 13.51 (which implies the sparsity Condition 
13.41 see Section [4. 3. ID : Example 12.11 fulfills it, even if p^ n. The noise level is 



of the order y^log{p)/n{p/ sq VI): the additional factor (p/sq V 1) occurs due 
to estimation of p variances in Qq. However, the interesting scenario is for the 
case where sq > const.p since sq <^p corresponds to a DAG where most nodes 
are isolated having no edges to other nodes; thus, for sq > const.p^ we obtain 
the usual noise level of the order Y^log(p)/n, as in high-dimensional regression 
problems. 

For the equal variance case with p = 0{n/ \og{n)) sufficiently small, our result 
in Theorem 15 . 1 1 (and its comment below) is most clear in that we essentially only 
require the beta-min Condition 13.51 for the true DAG Bq, and the identifiability 
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Condition 15.11 for the error variances: we can then recover the true underlying 
unique DAG Bq. Thus, we have identified an important class of models where 
estimation of the order of variables and the true underlying DAG is possible 
without requiring the very restrictive strong- faithfulness condition ([2]) or ([3]). 



7 Proofs 

7.1 Bounds on a subset of the probability space 

We present some explicit bounds assuming we are on a set of the form hI^qTa;, 
where the sets 7fc are defined below. Then we show in Subsections 17.3.11 17.3.21 
and 17.3.31 that each 7fc, k = 0, . . . , 3 has large probability for an appropriate 
choice of the constants and of the parameters Ai, A2 and A3 involved in the 
definition of these sets. In fact, we will show that one can take 



n 



Let for some constant Kq > 0, 

To := {u] > l/Kl V j}. 

Let us write -L if AT^ and ej are independent. For all vr and j, define 
ej(7r) = Xj - A/3°(7r), and Bj{TT) := {f3j : Xk ± ej(7r), V /3fcj / 0}. Moreover, 
let Bin) := {B = {f3i,. . . ,/3p) e B : (3j G Bjin) V j}. For some 5i > and 
some Ai > 0, write 

Ti := (2 \eJ{n)X{f3, - /30(vr))|/n < 6^ W^if^^ " ^°^))lln 

+Xl{s + s{7r))/6i, VS = G^(7r), V tt 

We let for some A2 > 0, 

r2--={EX J ^4A2(p + .(-)), v.j, 

J — ^ ^ 

where we use the notation 

\\v\\l := v^v/n, V G M". 
Finally, for some some 63 > and some A3 > 0, let T3 be the set 



r3:=|||A/3||„> 
Recall that s/3 := / 0}. 
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I, V/3 



Lemma 7.1 Define {BqjQq) := {BQ{Tt),Qo{Tf)) and s := s^^^. Assume that 
Condition \3.1\ holds. Suppose we are on hI^qTa; with < 6i < 1/Kq and 
< 62 < l/(2i^'QCjQ). Take the tuning parameter > Xf/Si + X2/S2- Then 



Si 62 J S2 Si 

Proof. Let e := e(7r). We apply the Basic Inequality 

inm + x^s<ueo) + xho, 

or equivalently 

p + t logc^l + A^l < f M| + t log + A^.o. 

j=i j=i I J I j=i 

which gives, using log(det(Eo)) = Z]j=i log = Z^^=i log I^^^P, 

Since t!)| > l/i^g (since we are on To) and I'^jP < ctq (by Condition 13. ip . we 
know that 

lr)0|2 



But then, using 



we get 



log(l + x) < X - x^ -1 < X < c, 

2(1 + c)"^ 



We plug this back into the Basic Inequality to get 

P -2 |~0|2 , /~2 |,~,0|2\2 P /II ||2 \ 

Rewrite this to 
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j j=i ^ I j I j=i 



We now apply 



P /ll?.l|2 lr,0|2^ 



E 



~,0|2 



I,:,. 1 2 



\UJ 



0|2 



+ E 



0|2\ /|r,0|2 



But, by the Cauchy-Schwarz inequality and using that we are on T2, 



E 



■^J lln 



\0J 



,0|2 



^(E 



J ll n - l^j 



0|2\ 2 



1/2 



J I J 



1/2 



<2^(p + s)Ai|^ 



1/2 



-I 



< 



(p + s)Ai 



+<^2i: 



Invoking trace(GoSn) = trace(GoSn)) that is 



P "112 
'^3 Wn 



E 



|a;0|2 



^ ll?.||2 

E lFj lln 



and using that we are on 71, we see that 



p / - 2 ir,oi2\ 2 



\UJ-i 



,9 An(i» + s) A?s 

62 Ox 
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7.2 Exploiting the beta-min condition 

Lemma 7.2 Let s = sj^ be the number of edges of Bq and s = s^^ be the 
number of edges of B. Suppose that for some X, 

II-B-^oIIf < aVI, 
and that for some constant < rji < land < ry^ < 1 — ?7i 

#{\^j,k\>>^/m}>{i-vi)s. 

Then § > {I — rji — r/l)^. 
Proof. Let 

AT := {{k,j) : > A/r,2}, M := {{k,j) : - > V%}- 

Then for {k, j) e J\f n M'^ is holds that 

\Pkj\>\Plj\-\Pk,j-py>o, 

so that s >\M Ci M.'^l- Since \\B — Bo\\f < A-\/I, we must have 
{k,j)eMnM {k,j) 

E \kj-Pl/>\^^^M\xyvl 

{k,j)eAfnM 

Hence, 

\MnM\< rils. 

This gives 

\N n M^\ = \U\ -\J\f^M\>{l- r]i)s - j]l~s = (1 - r/i - r?|)s. 

□ 

Lemma 7.3 Suppose that for some 6b > 0, Sg > 0, Xq > and X one has 
6b\\B - BoWl + A^<^ss < A^so + AqS, 

where s > sq. Let X^6b > A^ + Aq and assume that 

#{|^°fel>A/%}> (1-^)5. 

Then ^ 

Sb\\B - Bofp + (x'^Ss - ^ ^ ^ ^"^0, 

and s > (1 — ?7i — ?7|)so- 
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Proof. Since s > sq, we find that 

5b\\B - BoWl < (A^ + Ao)s < SbX'^s. 
This gives by Lemma 17.21 



s > (1 - m - V2)s- 



But then 



6B\\B-Bo\\l+iX^6s 



A2 



1 - r/i - 



S < X Sq. 



□ 



7.3 The sets Tk, k = 0, 1,2,3 
7.3.1 The set 71 

Lemma 7.4 Let Z be a fixed n x m matrix and ei,...,e„ be independent 
Af{0, a q) -distributed random variables. Then for all t > 

PI sup \e Zp\/n > ao{^/2m/n+ ^/2t/n) < exp[-t]. 
V||Z/3||„<1 J 

Proof. Assume without loss of generality that Z^Z/n = I and define := 
e'^ Zk/{ao\/n). Then Vi,...,Vp are independent and A/'(0, l)-distributed. It 
follows that for ah G {2, 3, . . .}, 



2\N 



{2N)l 
2^N\ 



< Nl. 



But then by Bernstein's inequality (see Bennet 1962I ]). for all t > 0, 
p('^(Vfc -EVi) > 2Vt^ + 2t^ < exp[-t]. 

Now use that EiLi^^fe = ^- We get 

P >m + 2Vtm + 2t^ < exp[-t]. 

m + 2\/tm + 2t < {V2m + \/2t)^. 



But 
Furthermore, 



sup \e'^ZP\/n 

\\ZI3\\„<1 



(10) 



fc=i 



□ 
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We are dealing now with the problem of uniformly controlling over all permuta- 
tions TT. We consider the local structure at each node of a DAG {Bq{tt), J7o(vr)) 
with -Bo(^) ='■ iPkji'^))- ^ji'^) be the set of incoming edges at node 

j. Given Sj{TT), the vector X/3?(7r) is the projection in L2{P) of Xj on the 
linear space spanned by {Xk}k£S {-n)- Moreover, ej(7r) is the anti-projection 
ej(7r) = Xj — Xf3j{7r). In other words (for j fixed) if the parents Sj{7r) at node 
j are given, then the coefficients /S^l j(vr) and noise term ej(7r) are given as well. 
Also, the set of variables X/^ that are independent of ij is then given. Recall 
that 13j{TT) := {(3j : Xk -L ej(7r), V I3k,j / 0}. Thus, for each fixed j, if ^^(Tr) is 
given then the local situation (ej(7r), /3^(7r), ;Bj(7r)) at node j is given. 

Let JIj{m) be the collection of all permutations giving DAGs {BQ{TT),Qo(7r)) 
with edges (5'i(7r), . . . , Sp{tt)) with |5j(7r)| = m. If for some m G {0, 1, . . . ,p}, 
we know that tt G Ilj{m), that is, we know that node j has m parents, then 
there are at most (,^) possibilities for the local situation at node j. 

Theorem 7.1 Assume Condition \3.1[ Then for all t > 0, 



P max sup 2 |eJ(7r)(X(/3,- - /3°(7r))|/n - 5i ^ \\X{Pj - /30(^) 

V SGB(7r) j=i j = i 



\l 



^ Aa^{sB + H^)) ^ a^{t + 2logp){sB + s{7r)) 
~ n6i n6i 

< exp[— t]. 

Proof. Let Aj{7r) be the event 

AjiTT) := /3, e Bj{7r) : sup \eJ{7r){X{pj - 

L ||X(/3,-/30W)||„<l 



>ao{\l (^(^/^.- + . + Sj{7r) logp + 21ogp) 



n V n 

Then by Lemma 17.41 for all t > 0, vr and j 

F{Aj{iT)) < exp[-(t + Sj(7r)logp + 21ogp)]. 

We now let vr vary over all permutations such that [^^(Tr)! = m. We then get 

P(u^en,(m)^i(vr)^ < (^^^ exp[-(t + m logp + 2 logp)] < exp[-(t + 2 logp)]. 
Next, we let vr vary over all permutations. We get 

p(^U^^j(7r)^ < ^ ^max^p(^U^gn,(m)^j(^)) 
< pexp[— (t + 2 logp)] < exp[— + logp)]. 
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Finally 

1P(U^=1 Utt Aj{Tr)) < pmaxP(U,r^,-(7r)) 
< pexp[— (t + logp)] < exp[— t]. 
Now, we use that for all 6i > 0, 

j=i ^ 

where s = Ylj=i Sj,s = 



7.3.2 The set 7^ 

Theorem 7.2 Assume Condition \3.4\ Then for all t > 0, 

2 



^ /lle-(vr)||2-|cD0(vr)p 



> 



g^ P^+ (l + 8«)g(7r) logp + 2p log ^ g^ 4p(t^ 
< 2exp[-t]. 

Proof. Define 



+ log2p) 



|6,(^ )||^-|(I; °(^)P 



□ 



— im? — ■ 

Using the same argument as in (llOp . we see that for each vr, and for all t > 0. 



Define 



where 



p(|Z,(.)|>2(y|+^)) <2expH]. 

\Zj{n)\ 



Z,(7r) := 



2aj(7r) 



t + Sj(7r) logp + log(l +p) + logp 
n 

+ Sj{7r) logp + log(l +p) + logp 
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It follows that 

P I max max max 'Zij(TT) > 1 

\l<j<p 0<m<p TTgHj (m) 

< 2p{p + 1) exp[— (t + mlogp + log(l + p) + logp)] < 2exp[— t]. 

Invoking log(l +p) < 2 logp, we see that with probability at least 1 — 2 exp[— i], 
it holds that for all permutations vr and all j, 



^ ^ 2 1 * + Sj{Tr)logp + 2logp ^ t + 5j(7r)logy> + 21ogp 
which implies 

^ |Z (7r)p < + ^j(^)^ogy + ^^ogP I t + gj(7r)logp + 21ogp y 

j=i j=iV n J 

^ o(P^ + s(7r) logp + 2plogp\ ^ /4pt2 + 8 log^p + 'iplog^ p 



n 

Next, we insert that for all j, Sj(7r) < Q!n/(logp), to find 

p n 

X^^iW ^^^"^P - Xl("'^/(^°SP))sj(vr) log^p 
i=i j=i 

= as(7r)nlogp. 

We then arrive at 

^|Z (7r)|2 < g /" ^^ + (1 + logp + 2plogp \ ^ ^ / 4p(t^ + log^ p) 

□ 



7.3.3 The sets 7^ and 7B 







Theorem 7.3 Assume Condition \3.1\ and Condition \3.S[ For all t > 0, with 
probability at least 1 — 2exp[— t], 



\\X(3\\n > 

uniformly in all (3 S 
Proof. 



n 



n 



We follow here the arguments used in lRaskutti et alj 2010l |. which we slightly 
adjust to the style of the present paper. They show that for 5'^ = 1/4 (in fact 
for = o(l) as n — )• cxd), and for all r > 



E inf 



\\Xp\\n>l-8'^-^cjo\l^r. 
=1 V n 
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Hence, for all 1 < m < p, 



E inf \\Xp\\n > (1 - 5^)Amin " 3cTo 

Si3<m, P||2=l 



m logp 



n 



Apply the contraction inequality given in iMassarti 2003l | to find that for all 
t > 0, 



P 

Thus 
P 



E inf \\X(3\\„ 

sp<m, ||/3||2=1 



inf „ \\X/3\\n 



sp<m, m\2 = l 



>\l-] <2exp[-t]. 



(l-(^3)Ainin-3(Jo 



m logp 



n 



inf \\XI3\\n 

sp<m, ||/3||2=1 



>\—\ < 2exp -i , 
V n ' 



and hence 



P 3 /3 : 



(1 - 6'^)Aram - ScTq 



3/3 logp 



n 



\\Bh - \\xm\. 



n 



□ 



Lemma 7.5 Assume Conditions \3.1[ [37B. \3.3\ and\3.4\ and that 



n 



1/Ko := 3Arain/4: - V ^^^"^^^ " 3<ToVa + a > 0. 
Let for some t > 0. 

f3:=l\\X/3\\n< 



3A_/4-,/^(i±i^-3ao-/^'^'°^^ 



n 



n 



I, V/3 . 



Then P(73) > 1 — 2exp[— t] and one has on Ts, for all B = . . . G B 
and all vr and all j, 



|X(/3,-/3°(vr))||„>||/3,-^0(7r)||2/i^2- 



(11) 



Moreover, 



Q?. > l/Kl 



Proof. Theorem O states that P(7^) > 1 - 2exp[-t]. Result (dH) follows 
immediately, since sa, + Sao < (a + a)n/logp. For the last result, we define 

Pk,j •= for j and PJ^^ = 1. Then on Ts, 

^1 = ||X^7 11^ > 11/3- ||n/i^o'> 1/^0- 



□ 
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7.4 Collecting the results 



Lemma 7.6 Define (i?o,^o) ■= (-Bo(vr), Oo(7r))- Assume Conditions \3.1\ . \3.S^ 
[13 [X7t and\3.5l Suppose we are on n|^o7fc w^^t/i < 5i < l/i^Q and < 
< l/(2-f^o^o) '^3 ~ A3VcM-~ay^?V^og^ > 1/Kq > 0. Tafce the tuning 
parameter > Xf/Si + A|/(^2- 

Sw < -TP) 



5. < ( 1 



2 ._ {p/sQ + l)Ai Af 
Let>?5B := A^+Aq, andrjl := r/gA^n/ logp = r/o(A^ + Ao)(n/ logp)/(5|. Assume 

x\ - -r^^) ■■= x's, > 0. 

1 - r?i - y 
T/zen 

- ^oll^ + 6w\\^ - ^oWf + A^<5r,s < A^So, 
and s > {1 — rji — r/|)s > (1 — f?i — f?2)^0- 
Proof. This follows from Lemma |7. II and Lemma 17.31 



□ 



Lemma 7.7 Assume Conditions\3Ji\3M\EE and[3^ with 



2(t + logp) 



3Amin/4 - ^ - 3(70 Va + a > 1/Ko > 0. 



Take 



n 



2 _ 4CT^(l + t + 21ogp)logp 
logp n 



logp nlogp / n 



, 9 9 log P 



and 



Then 



3 / 2(t + logp) 

(^3 := TAmin - V ■ 

4 V n 
P(nLoTfc) > l-4expH]. 
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Proof. This follows from combining Theorem 17.11 Theorem 17.21 and Lemma 
[731 

□. 

Theorem 7.4 Assume Conditions HOI HOI HOI [3^ and[J75[ Let us take t = 
\ogp, giving = 4/p (suppose p is large). Take n sufficiently large, and logp/n 
hounded. Let 



ci := 96, c, := 3840, c = 4( ^^Z^" + ^^^^^^ + ^^"^ 



A4. ' A2. 

mm mi 

Some possible choices for the constants are 



\ mm mm / 



a = a = K 



Then 



We let 



288ag' ■ A„ 

. A*^ ■ A ■ 

- - 64^' - 

.i^, Af = 12agi^, Ai = 60i^, Ai = 9al^-^. 
n n n n 



y2 f jp/sQ + l)c2'7o _^ CiO-g ^ logp 



Ai.„ Ai 



n 



\2 _ f I (P/'SQ + l)c20-o ^ cio-g 32 logp 



AL„ ALJ At 



and 



V min min 

(p/so + 1)C2(T^ CifjgX 32 



2\ -1 



% = % C + -J + 



A^. A2. /A^. ■ 

mm mm/ mm 

Proof. This follows from using some bounds and exact choices in Lemma 17.61 
and Lemma [77n In particular, with A^ = clogp/n, we take A^ = (A^ + Xq)/6b. 
With 7/1 = and 5rj = 1/2, the equation 

gives 

^1 ^ Ciag C2(T^ 



"A^ . cA^ . 

"-mm ''^ '^mi 

' ip/fo±}}c2^ Cia^\ f ^2f^, {P/S0 + 1)C2(T0 , Cia^\\~^ _ C 

Ail ^^JV~^'~^'[^ AtZ ^Ai" 
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With 

.4 „ „2 \ -1 



we have to solve for c 



2 -fr^^^ ^ ^^l^V 
mm mm / 



I rA2 rA'^ y I A2 / 2' 

V "'■"■min "'■'^min/ V ■'^min ^^min/ 



This yields 



\ mm mm/ 



4 — 2\ ^ciag caag 

mm mm 



□ 



7.5 Proof of Theorem 15.11 

We investigate what happens on the set fl^^^^Ta defined in Subsection 17. 1[ The 
results in Subsection 17.31 sav that n^^^TI- has probability at least 4exp[— t] for 
a proper choice of the constants and parameters involved. Theorem 15.11 then 
follows directly. 

Lemma 7.8 Assume Condition \3.1[ Condition \5.1\ and Condition \5.S\ Suppose 
we are on r\\^^Tk, with 



and 



63 - XsV^c^ > ^ > 0, 



Then vr = ttq and 



Proof. We have 



P_ J>^ J>_ ||~ ||2 



i=i i=i i=i ''^j' 

So we find 

j=i 3=1 j=i ^''^j' ^ 



We have 

P /I \ P II? I|2 l,~,0|2 
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We know that 



log(det(Eo)) = ^log|a;°|2 = ^log|a;°|2 = 0, 



since Iw^p = 1 for all j. Moreover 



log(l + x) < x - 2(1 + c)^ ^^' -1 < x < c. 



So, since \oj^\^ < (Tq, 



log|^°|^<(|^°|^-l)-^(|^,T-lf- 



Hence 



This gives 



Therefore 



o<^(i^,T-i)-^E(i-°i'-i)'- 

j=l j^i 

t(i-l'^.T)<-^Bl-°l'-i)'- 

i:im--/3°ii^+^x:(i^,T-i)^+A^^ 

i=i j=i ' j' 

< 2 j2 eJX0, - ^0)/n + A^.o + ^ + ^2 - l)^ 

i=i ^ i=i 

where we invoked that we are on the set 72- We find 

This gives in a next step, using that we are on 71, 

(1 - S,) E \\X0j - + - - 1)^ + 



j=i ^ ^ j=i 

Ai(p + 5) A|(£+s) 2, 

< 1 ^ \- A So- 

02 Oi 
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Hence, using that we are on Ts and invoking Condition 15.21 

Ai(p + g) A?s 2, 

- 1 1" 1 r A So 

02 Ol 

where we use that s < and sq < (and also p < p^). Since (using again 
Condition 15. 2p p\ogp/n < a*, and 



7 = 1 ^ ^ 



2 

_ > p/r/^ if TT 7^ VTo, 



find that if vr 7^ ttq, 



1 \ p /2Ai A? o\ n 
'^2 — < ^ + + A^ a*- 



2(7^ J Vuj \ ^2 ^1 J *logp' 

which is in contradiction with Condition 15.11 and the further condition (jl2p 
imposed in this lemma. So we must have vr = ttq, and thus 0^=1 for all j. 
The result now follows from restarting the proof with o)^ = 1 for all j plugged 
in. 

□ 
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